ML Cheat Sheet

Regression (Predicting a Number) Used when the output is a continuous value (e.g., price, temperature, stock value).

Linear Regression: Fits a straight line to data.
Example: Predicting the price of a house based on square footage.
Syntax: from pyspark.ml.regression import LinearRegression lr = LinearRegression(featuresCol="features", labelCol="price") model = lr.fit(train_data

Classification (Predicting a Category) Used when the output is a discrete label (e.g., Yes/No, Red/Blue/Green).

Logistic Regression: Despite the name, it's for classification. Predicts the probability of a class.
- Example: Predicting if a customer will "churn" (leave) or stay.
- Syntax: from pyspark.ml.classification import LogisticRegression
Naive Bayes: Based on Bayes' Theorem. Assumes features are independent.
- Example: Email Spam filtering or Sentiment Analysis (Positive/Negative).
- Syntax: from pyspark.ml.classification import NaiveBayes
Random Forest: A "forest" of many decision trees. Very robust and popular.
- Example: Predicting if a loan application is "High Risk" or "Low Risk."
- Syntax: from pyspark.ml.classification import RandomForestClassifier

Clustering (Finding Hidden Groups) Unsupervised learning; the data has no labels. The computer finds patterns on its own.

K-Means: Groups data points into 'K' number of clusters based on similarity.
Example: Grouping users by their "shopping persona" (e.g., Bargain Hunters vs. Luxury Buyers).
Syntax: from pyspark.ml.clustering import KMeans kmeans = KMeans(k=5, seed=1) model = kmeans.fit(dataset)

Time Series (Predicting the Future) Used for data ordered by time (daily sales, hourly sensor readings).

ARIMA / SARIMA:
- ARIMA: (AutoRegressive Integrated Moving Average) looks at past values and past errors.
- SARIMA: Includes Seasonality (e.g., sales always spike in December).
Example: Forecasting electricity demand for the next 48 hours.
Note: PySpark MLlib doesn't have a native ARIMA. You typically use statsmodels inside a Spark pandas_udf to run it in parallel.
Syntax (Statsmodels): from statsmodels.tsa.statespace.sarimax import SARIMAX model = SARIMAX(data, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12)) results = model.fit()

Recommendation Engines

ALS (Alternating Least Squares): A type of Collaborative Filtering.
Example: Netflix "Because you watched..." or Amazon "Users who bought this also bought..."
Syntax: from pyspark.ml.recommendation import ALS als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating") model = als.fit(train_data)

Quick Comparison Table Technique Goal Type Library (PySpark) Linear Regression Predict a number Supervised pyspark.ml.regression Logistic Regression Predict a category (0 or 1) Supervised pyspark.ml.classification Naive Bayes Classify text/labels Supervised pyspark.ml.classification K-Means Find hidden groups Unsupervised pyspark.ml.clustering ARIMA/SARIMA Forecast future time steps Stats/Time Series statsmodels (via Pandas UDF) Random Forest High-accuracy classification Supervised pyspark.ml.classification The PySpark Workflow Pattern In PySpark, almost every ML task follows this same 3-step pattern:

VectorAssembler: Combine your feature columns into a single "features" vector column.
Fit: Train the model: model = algorithm.fit(df).
Transform: Make predictions: predictions = model.transform(new_df).

ML Frameworks: scikit-learn, TensorFlow, PyTorch, and More

scikit-learn (sklearn)

Best for: Tabular data, classical ML (regression, classification, clustering, preprocessing, pipelines).
Strengths: Simple API, fast prototyping, tons of built-in models and metrics, great for interviews and competitions.
Limitations: Not for deep learning, limited GPU support, not for large-scale distributed training.
Typical workflow:
1. Preprocess (LabelEncoder, OneHotEncoder, StandardScaler, etc.)
2. Split data (train_test_split)
3. Fit model (model.fit)
4. Predict/evaluate (model.predict, accuracy_score, confusion_matrix)

TensorFlow (and Keras)

Best for: Deep learning (neural networks, CNNs, RNNs, transformers), large-scale data, production deployment.
Strengths: GPU/TPU support, scalable, flexible, Keras API is user-friendly, used in industry for image, text, audio, tabular, and time series.
Limitations: More complex than sklearn for simple tasks, steeper learning curve for custom models.
Typical workflow:
1. Define model (Sequential or Functional API)
2. Compile (optimizer, loss, metrics)
3. Fit (model.fit)
4. Evaluate/predict (model.evaluate, model.predict)

PyTorch

Best for: Deep learning research, custom neural networks, NLP, computer vision, academic work.
Strengths: Dynamic computation graph (easier debugging), Pythonic, strong community, used in research and production.
Limitations: Slightly more code for basic tasks than Keras, but more flexible for advanced models.
Typical workflow:
1. Define model (nn.Module)
2. Define loss/optimizer
3. Training loop (forward, backward, step)
4. Evaluate/predict

Other ML Libraries

XGBoost/LightGBM/CatBoost:
- Specialized for gradient boosting (tabular data, competitions, high accuracy)
- Often outperform sklearn’s GradientBoostingClassifier/Regressor
- Used via their own API or sklearn wrappers
statsmodels:
- For statistical models (linear regression, ARIMA, time series, GLM)
- More statistical tests, summary tables, p-values
spaCy/NLTK:
- For NLP tasks (tokenization, parsing, entity recognition)
Prophet:
- For time series forecasting (easy API, handles seasonality/holidays)

When to Use What?

Framework	Best For	Not For
scikit-learn	Tabular/classical ML, fast protos	Deep learning, images
TensorFlow	Deep learning, production, scale	Small tabular problems
PyTorch	Deep learning research, NLP, CV	Simple tabular ML
XGBoost/LGBM	Tabular, competitions, accuracy	Deep learning, images
statsmodels	Statistical analysis, time series	Deep learning
spaCy/NLTK	NLP preprocessing, pipelines	Tabular, vision
Prophet	Time series forecasting	Classification

General Advice

Use scikit-learn for most tabular ML tasks, interviews, and quick experiments.
Use TensorFlow/Keras or PyTorch for deep learning (images, text, audio, complex models).
Use XGBoost/LightGBM for tabular data when you want the best accuracy (after trying sklearn models).
Use statsmodels for statistical analysis, time series, and when you need interpretability (p-values, confidence intervals).
For NLP, use spaCy for pipelines, NLTK for classic NLP, and transformers (HuggingFace) for state-of-the-art models.

Example: Keras Neural Network (TensorFlow)

from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32)
model.evaluate(X_test, y_test)

Example: PyTorch Neural Network

import torch
import torch.nn as nn
import torch.optim as optim
class Net(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x
model = Net(X.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
# Training loop omitted for brevity

Practical ML Model Cheat Sheet (Interview/Assessment Prep)

General ML Workflow (Tabular Data)

Load Data: Read your data into a DataFrame.
Define Features/Target: Select feature columns (X) and target column (y).
Preprocess: Encode non-numeric data if needed (LabelEncoder, OneHotEncoder).
Split: Use train_test_split for train/test sets.
Fit: Train your model (fit on X_train, y_train).
Score/Evaluate: Use score(), accuracy_score, or other metrics on X_test, y_test.

Random Forest (RF) vs Gradient Boosting (GB)

Random Forest:

Ensemble of many decision trees, built in parallel (bagging).
Robust, less prone to overfitting, fast to train, fewer hyperparameters.
Good baseline for tabular data.

Gradient Boosting:

Ensemble of trees built sequentially, each correcting the previous (boosting).
More tunable parameters (learning rate, n_estimators, etc.), can overfit if not tuned.
Often achieves higher accuracy if tuned well, but slower to train.

When to use:

Try RF for a quick, robust baseline.
Use GB if you want to push for best accuracy and can tune parameters.

Typical usage (scikit-learn):

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb.score(X_test, y_test)

Naive Bayes

What is it?

Probabilistic classifier based on Bayes’ theorem, assumes features are independent given the class.
Very fast, works well for text classification (spam, sentiment, etc.).

Types:

GaussianNB: for continuous features.
MultinomialNB: for counts (text, word counts).
BernoulliNB: for binary features.

When to use:

Text classification, high-dimensional data, simple/fast baseline.

Typical usage (text):

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, y)
model.score(X, y)

Metrics & Confusion Matrix

Confusion Matrix:

For binary classification: [[TN, FP], [FN, TP]]
Recall: TP / (TP + FN)
False Positive Rate: FP / (FP + TN)

Accuracy: (TP + TN) / (TP + TN + FP + FN)

PCA vs LDA

PCA (Principal Component Analysis):

Unsupervised, reduces dimensionality by maximizing variance.
Use when you want to reduce features, handle multicollinearity, or don’t have labels.

LDA (Linear Discriminant Analysis):

Supervised, reduces dimensionality by maximizing class separability.
Use when you want to separate classes and have labels.

General Tips

All features for most ML models must be numeric (encode categorical/text as needed).
For text targets (not categories), use NLP/deep learning models (not RF/GB/NB).
Always preprocess test data with the same steps/vectorizer as training data.

Example: Naive Bayes Text Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love ML", "ML is great", "I hate spam", "spam is bad"]
labels = [1, 1, 0, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, labels)
model.predict(vectorizer.transform(["payment plan"]))

Model Selection Table

Model	Use Case	Data Type	Pros	Cons
Linear Regression	Predict a number	Numeric	Simple, interpretable	Only linear relationships
Logistic Regression	Predict a category (0/1)	Numeric/categorical	Probabilities, fast	Only linear boundaries
Naive Bayes	Text/category classification	Text/categorical	Fast, works for text	Strong independence assumption
Random Forest	Classification/regression	Tabular	Robust, less overfitting	Slower, less interpretable
Gradient Boosting	Classification/regression	Tabular	High accuracy, flexible	Slow, needs tuning
K-Means	Clustering	Numeric	Unsupervised, simple	Needs k, only spherical clusters
ARIMA/SARIMA	Time series forecasting	Time series	Handles trends/seasonality	Needs stationary data